Code
library(tidyverse)
library(plotly)
library(rafalib)
library(kableExtra)
library(gridExtra)
library(knitr)First, we load packages and load data. Note that we have to change some data types from chr to factor every time we load in the file for a new qmd, because when the file is read, it is assigned chr by default.
There are 1400 rows (patients) in the cleaned data.
library(tidyverse)
library(plotly)
library(rafalib)
library(kableExtra)
library(gridExtra)
library(knitr)data <- read.csv("../data/data_clean.csv")
# double check variables have no missing values
# colSums(is.na(data))
data <- data |>
mutate(
tumor_stage = as.factor(tumor_stage),
her2_status = as.factor(her2_status),
er_status = as.factor(er_status),
pr_status = as.factor(pr_status),
her2_status_measured_by_snp6 = as.factor(her2_status_measured_by_snp6),
death_from_cancer = as.factor(death_from_cancer),
neoplasm_histologic_grade = as.factor(neoplasm_histologic_grade),
overall_survival = as.factor(overall_survival)
)
# glimpse(data)
nrow(data)[1] 1400
The median age at diagnosis is 61.13, and the mean age is 60.61. The ages follow a roughly bell-shaped curve shape but are very slightly skewed to the left.
summary(data$age_at_diagnosis) Min. 1st Qu. Median Mean 3rd Qu. Max.
21.93 51.00 61.13 60.61 69.89 96.29
age_hist <- ggplot(data, aes(x = age_at_diagnosis)) +
geom_histogram(
binwidth = 5,
boundary = 25,
color = "black",
fill = "steelblue",
aes(text = paste0("Age range: ", after_stat(xmin), "-", after_stat(xmax),
"<br>Count: ", after_stat(count)))
) +
labs(
title = "Distribution of Age at Diagnosis",
x = "Age at Diagnosis (years)",
y = "Count"
)
ggplotly(age_hist, tooltip = "text")Below shows the distribution of tumour stages from 1-4. Most patients at the time of diagnosis had stage 1 or stage 2 breast cancer.
summary(data$tumor_stage) 0 1 2 3 4
3 475 800 113 9
tumor_stage_graph <- ggplot(data, aes(x = tumor_stage)) + geom_bar(color = "black",
fill = "steelblue") + labs(
title = "Tumour Stages of Patients",
x = "Tumour Stage (0-4)",
y = "Count"
)
ggplotly(tumor_stage_graph)Tumour sizes are heavily right-skewed, with quite a few outliers present. The median size was 22 mm, and the mean was 25.85 mm.
summary(data$tumor_size) Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 17.00 22.00 25.85 30.00 180.00
tumor_size_boxplot <- ggplot(data, aes(y = tumor_size)) +
geom_boxplot(fill = "lightblue") + labs(
title = "Tumour Sizes of Patients",
y = "Tumour Size (mm)"
)
ggplotly(tumor_size_boxplot)The number of lymph nodes that were examined positive is heavily skewed to the right, with the median lymph nodes being 0 and the mean 1.892.
summary(data$lymph_nodes_examined_positive) Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 0.000 1.892 2.000 41.000
lymph_node_hist <- ggplot(data, aes(x = lymph_nodes_examined_positive)) +
geom_histogram(
binwidth = 3,
color = "black",
fill = "steelblue",
aes(text = paste0("Lymph node range: ", after_stat(xmin), "-", after_stat(xmax),
"<br>Count: ", after_stat(count)))
) +
labs(
title = "Distribution of No. of Positive Lymph Nodes",
x = "Number of Lymph nodes examined positive",
y = "Count"
)
ggplotly(lymph_node_hist)The majority of patients had ER positive breast cancer compared to ER negative.
A majority also had HER2 negative breast cancer compared to HER2 positive.
er_barplot <- ggplot(data, aes(x = er_status)) + geom_bar(color = 'black', fill = "steelblue") + labs(x = "ER Status", y = "Count")
her2_barplot <- ggplot(data, aes(x = her2_status)) + geom_bar(color = 'black',fill = "steelblue") + labs(x = "HER2 Status", y = "Count") + ggtitle("Distribution of ER Status Distribution of HER2 Status")
subplot(
ggplotly(er_barplot),
ggplotly(her2_barplot),
nrows = 1,
shareY = TRUE,
titleX = TRUE
) |> layout(yaxis = list(range = c(0, 1300))) # so that graph doesn't get cut offBy the end of the study, 790 patients did not survive, while 610 survived.
summary(data$overall_survival) 0 1
790 610
The histogram below shows the distribution of survival times. The data is skewed to the right, with more patients having a short overall survival time. The median survival time was 117.6 months, and the mean was 127.8.
summary(data$overall_survival_months) Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1 61.9 117.6 127.8 189.1 351.0
survival_hist <- ggplot(data, aes(x = overall_survival_months)) +
geom_histogram(bins = 20, boundary = 0, fill = "steelblue", colour ='black')
ggplotly(survival_hist)